Deep web miner

ABSTRACT

Systems, computer implemented methods and computer program products are provided for selectively capturing and/or evaluating information including content and metadata from across a network such as the “wide world web” (WWW), or more generally, the Internet. A deep web mining tool may be utilized to exploit the deep web by understanding forms, search engines and results pages. Moreover, deep web mining tool may be utilized to extract and exploit structured and unstructured content and metadata from web sites and documents, generate queries, capture and re-link web sites, crawl through web sites and non-HTML files and perform other aspects of obtaining and/or evaluating information.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication Ser. No. 61/027,718 filed Feb. 11, 2008 entitled “Deep WebMiner”, the disclosure of which is hereby incorporated by reference.

BACKGROUND OF THE INVENTION

The present invention relates to tools for selectively capturing networkaccessible information including content and metadata.

The Internet, including the World Wide Web, is a source of vastquantities of data. In this regard, traditional search engines attemptto locate and index this data in order to respond with relevant resultsto user-initiated queries. However, conventional search engines areextremely limited in their results. For example, the content on theInternet may be characterized as “surface web” content, whichtraditional search engines can index, and “deep web” content, whichsearch engines typically cannot index.

Deep web content includes for example, information in private databases,information that is retrievable only as a result of a executing a queryor processing an on-line form, unlinked content, information stored inprivate or otherwise secure network locations, scripted content,non-hypertext markup language (HTML) files such as images, video, audio,Portable Document Format (PDF) files, executable files and other typesof content that are not otherwise accessible to be crawled byconventional search engines.

Moreover, it is estimated that the deep web comprises a significantportion of the content associated with the Internet. Accordingly, it islikely that a substantial amount of information that may be relevant toa query topic is inaccessible to traditional search engines as theytypically do not crawl or otherwise index the deep web.

BRIEF SUMMARY OF THE INVENTION

According to aspects of the present invention, systems, methods andcomputer program products are provided for extracting information from anetwork by obtaining seed information from a user and by identifying asearch engine to utilize for performing deep web mining. The seedinformation provided by the user is mapped to query terms suitable foruse with the identified search engine. Once the query terms have beenmapped, an iterative mining process is performed by retrieving a querypage having a form for accessing the search engine and by simulatingentry of the form to automatically submit a query to the search enginebased at least in part, upon the derived query terms.

Addresses of interest are identified from the query results and thenetwork is crawled to obtain content and/or metadata from the identifiedaddresses of interest. Moreover, a local, navigable copy of the contentobtained from crawling the network may be build at a local storagedevice. Still further, the resulting content returned from the crawlersis analyzed to generate new content based query terms, which are used tosubmit new queries to the search engine as part of the iterativeprocess.

According to further aspects of the present invention, a computerprogram product is provided for performing deep web mining operations.The computer program product includes a computer usable medium havingcomputer usable program code embodied therewith. The computer usableprogram code comprising computer usable program code configured todefine a new task corresponding to a concept space associated with atopic of interest to a user. The computer usable program code alsocomprises computer usable program code configured to obtain seedinformation with regard to the concept space including identifying atleast one of an on-line form and at least one search term.

Still further, the computer program product comprises computer usableprogram code configured to create at least one deep mining threadassociated with the defined new task, wherein the deep web mining threadperforms a mining process. To implement the mining process, the computerprogram product comprises computer usable program code configured todefine a plurality of content-service threads and crawler threads.Computer usable program code is also configured to generate at least onequery derived from keyword information within the corresponding taskand/or terms obtained from analysis of crawled content and computerusable program code configured to queue the generated queries.

To implement the mining process, the computer program product furthercomprises computer usable program code configured to declare a specificimplementation of an abstract forms-based query service in acorresponding content-service thread that executes a deep mining processby matching an identified on-line form to a correspondingform-understanding plug-in that understands the format of the on-lineform, wherein the selected form-understanding plug-in simulates thesubmission of a query and identifies relevant result addresses.

The mining process is further implemented by computer usable programcode configured to queue query result addresses in a crawler queue,computer usable program code configured to asynchronously service eachresult address by a corresponding crawler thread that obtains contentand/or metadata that is cached in a local storage medium and computerusable program code configured to process the content of the returnedresults. Still further, computer usable program code is configured toupdate a display with a listing of the mined results, wherein the usermay browse a local navigable copy of the crawled results in isolation byselecting a navigable entry of the listing.

The computer program product may also optionally include computer usableprogram code that enables a user to build a form-understanding plug-inthat is usable by the computer usable program code configured to declarea specific implementation of an abstract forms-based query service in acorresponding content-service thread. In this regard, the computerprogram product may further comprise computer usable program codeconfigured to obtain a web site of interest, computer usable programcode configured to retrieve a query page having a form for accessing thesite's search engine, computer usable program code configured torecognize or obtain relevant form input(s), and computer usable programcode configured to generate or obtain example search term(s).

The computer usable program code that enables a user to build aform-understanding plug-in further comprises computer usable programcode configured to simulate entry of the form to submit a query to thesearch engine based on the example query term(s) and computer usableprogram code configured to receive query results returned in response tosubmitting the query form to the search engine, the query resultscomprising at least one page of addresses to locations on the networkhaving content responsive to the submitted query.

The computer usable program code that enables a user to build aform-understanding plug-in further comprises computer usable programcode configured to recognize or obtain result anchors of interest withinthe query results, computer usable program code configured to derive apattern that distinguishes result anchors from non-result anchors,computer usable program code configured to recognize or obtain next pageanchors of interest within the query results, computer usable programcode configured to derive a pattern that distinguishes next page anchorsfrom other anchors and computer usable program code configured forpersisting the resulting form-understanding plug-in for subsequent useby the deep web miner.

According to further aspects of the present invention, a method ofextracting information from a network comprises executing a userinterface on a computer for obtaining seed information from a user,where the seed information provides sufficient information to define aconcept of interest to the user, identifying a search engine to utilizefor performing deep web mining, mapping the seed information provided bythe user to query terms suitable for use with the identified searchengine and performing an iterative mining process until a stopping eventis detected.

The iterative mining process may be performed by retrieving a query pagehaving a form for accessing the search engine, simulating entry of theform to submit a query to the search engine based at least in part uponthe derived query terms and receiving query results returned in responseto submitting the query form to the search engine, the query resultscomprising at least one page of addresses to locations on the networkhaving content responsive to the submitted query and identifyingaddresses of interest from the query results for further processing.

The iterative mining process may further be performed by crawling thenetwork to obtain content from the identified addresses of interest andbuilding a local navigable copy of the content obtained from crawlingthe network in a local storage device. In this regard, links within thecontent of the local navigable copy may be limited to the local copyitself and may not function if the link contents were not captured bythe corresponding mining process.

The iterative mining process may further be performed by analyzing theresulting content returned from crawling the network, generating atleast one new content based query term based upon analyzing the searchresults, updating the query terms based upon at least one newcontent-based query term, dynamically conveying the results ofprocessing to the user such that the user can interact with adynamically changing local navigable environment while the miningprocess is iterating and dynamically reconfiguring the iterative miningprocess based upon user interaction, while the mining process isiterating.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The following detailed description of various aspects of the presentinvention can be best understood when read in conjunction with thefollowing drawings, where like structure is indicated with likereference numerals, and in which:

FIG. 1 is a block diagram of a system including a deep web miner forcapturing network accessible content and metadata according to variousaspects of the present invention;

FIG. 2 is an illustration showing the deep web miner of FIG. 1interacting with both the surface web and deep web aspects of theInternet according to various aspects of the present invention;

FIG. 3 is a flowchart illustrating a deep web mining process accordingto various aspects of the present invention;

FIG. 4 is a block diagram of an implementation of the deep web mineraccording to various aspects of the present invention;

FIG. 5 is a block diagram of nested operations performed by the deep webminer according to various aspects of the present invention;

FIG. 6-14 are screen shots of an illustrative user interface screens forinitiating a deep web mining process according to various aspects of thepresent invention;

FIG. 15 is an illustration of an exemplary search engine form accessedby a form-understanding plug-in of the deep web miner according tovarious aspects of the present invention;

FIG. 16 is an illustration of the deep web miner automatically fillingout the exemplary search engine form of FIG. 15 based upon a userinitiated search criteria, according to various aspects of the presentinvention;

FIG. 17 is an illustration of an exemplary search engine results pagereturned to the deep web miner in response to the search of FIG. 16;

FIGS. 18A and 18B are block diagrams of select components defining animplementation of a deep web miner according to various aspects of thepresent invention;

FIG. 19 is a table illustrating exemplary processors the deep web minermay implement according to various aspects of the present invention;

FIG. 20A is a graph showing information about possible query resultsfrom a single search term;

FIG. 20B is a graph showing information about possible query resultsfrom a paired query;

FIG. 20C is a graph showing information about possible query resultsfrom a chained query;

FIG. 21 is a block diagram of a component for training and/or building aform-understanding plug-in according to various aspects of the presentinvention;

FIG. 22 is a screen shot illustrating an exemplary implementation of thecomponent of FIG. 21, according to various aspects of the presentinvention; and

FIG. 23 is a block diagram of an exemplary computer system including acomputer usable medium having computer usable program code embodiedtherewith, where the exemplary computer system is capable of executing acomputer program product to provide deep web mining according to variousaspects of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

According to various aspects of the present invention, systems, computerimplemented methods and computer program products are provided forselectively capturing and/or evaluating information including contentand metadata from across a network such as the “wide world web” (WWW),or more generally, the Internet.

As will be described more fully herein, a deep web mining tool may beutilized to exploit the deep web by understanding forms, search enginesand results pages. Moreover, the deep web mining tool may be utilized toextract and exploit structured and unstructured content and metadatafrom web sites and documents, generate queries, capture and re-link websites, crawl through web sites and non-HTML files and perform otheraspects of obtaining and/or evaluating information. The deep web miningtool may be further utilized to output HTML files and supporting media,such as PDF files, text files, images, style sheets, scripts, movies,audio files, etc., to create a local navigable copy of mined content aswill be described in greater detail herein. Moreover, the deep webmining tool may be utilized to output Extensible Markup Language (XML)containing metadata such as Uniform Resource Locators (URLs), textcontent and query terms used for mining processes, etc.

Referring now to the drawings and particularly to FIG. 1, a generaldiagram of a computer system 100 is illustrated. The computer system 100comprises a plurality hardware and/or software processing devices,designated generally by the reference 102 that are linked together by anetwork 104. Typical processing devices 102 may include personalcomputers, notebook computers, transactional systems, purpose-drivenappliances, pervasive computing devices such as a personal dataassistant (PDA), palm computers, cellular access processing devices,special purpose computing devices and/or other devices capable ofcommunicating over the network 104.

The network 104 provides communications links between the variousprocessing devices 102, and may be supported by networking components106 that interconnect the processing devices 102, including for example,routers, hubs, firewalls, network interfaces wired or wirelesscommunications devices and corresponding interconnections. Moreover, thenetwork 104 may comprise connections using one or more intranets,extranets, local area networks (LAN), wide area networks (WAN), wirelessnetworks (e.g. WIFI, WiMAX), the Internet, including the World Wide Web(WWW), and/or other arrangements for enabling communication between theprocessing devices 102.

The illustrative system 100 also includes a plurality of processingdevices 108, e.g., servers, dedicated networked storage devices andother processing devices that store information in data sources 110. Theinformation stored in the data source(s) 110 may include contentutilized to generate HTML pages, structured and unstructured documents,media including images, audio files and/or video files, Flash or otherexecutable program(s), metadata, etc. The system 100 is shown by way ofillustration, and not by way of limitation, as a computing environmentin which various aspects of the present invention may be practiced.

Conventional web browsers may be executed on the various processingdevices 102 to retrieve content from the network 104 by identifying aunique URL that serves as the address for the associated content. Forexample, the content may be data such as a web page, document, mediafile, etc., that is maintained within the data source 110 of acorresponding one of the processing devices 108. The web browsers maythen update page layouts while asynchronously retrieving additionalcontent and/or performing other similar tasks. The web browsers may alsobe required to execute scripts or other designated executable code aspart of web browsing operations. For example, a web page may utilize ascript to interact with one or more servers, pull additional content,and modify itself dynamically. Eventually, a corresponding “web page” isassembled within the corresponding browser.

For purposes of clarity of discussion herein, the term “web page” orsimply “page” is used to refer to content that is retrieved, laid outand displayed in one or more browser windows in response to a singlerequest for content. For example, a web page may be generated from ahypertext markup language (HTML) document, a collection of documents,media, executable code, etc. In this regard, a web page may not consistof HTML at all. If a browser executes a script that retrieves additionalcontent within a predetermined time period (Δt), then that retrievedcontent may be considered part of the web page. However, if the browserexecutes a script that delays for longer than the predetermined timeperiod (Δt) before returning the content, then such content is notconsidered as part of the requested web page.

Moreover, a given “web page” may be a static page such that each visitto that static page returns the same content. Correspondingly, a givenweb page may be a dynamic page such that each visit to a specific URLmay return different content. Thus, if the user requests the same URLagain, the browser fetches a new “page”. The content and appearance maybe the same as or different from that of the previous URL request, butit is still a new “page”.

As an illustrative example, a user may enter a desired URL into an“address” bar of a web browser executing on a select one of theprocessing devices 102. The user may alternatively click on a link,select a “favorite”, or utilize any other method supported by theassociated web browser for designating the desired URL. The web browserbuilds and dispatches the request, synchronously retrieves the web pageassociated with the designated URL, and then asynchronously retrievesall supporting HTML pages and/or other content.

The Deep Web Miner:

According to various aspects of the present invention, a desktopsoftware application referred to herein as a deep web miner 112 definesa tool that is executed on a corresponding one of the processing devices102 to capture and/or evaluate information including content andmetadata that may be located anywhere across the computer system 100.

According to aspects of the present invention, the deep web miner 112includes a user interface component, 114, a mining component 116 and acrawling component 118, which are collectively utilized to mine, crawland/or otherwise evaluate information obtained from the network 104 asset out more fully herein. For example, the mining component 116 mayutilize seed information provided by a user via the interface component114 to derive query terms or other types of search parameters, performiterative mining processes (focused data collection) and dynamicallyconvey results to the user. In this regard, mining may be performed bysimulating the entry of forms to submit queries to one or more searchengines based at least in part, upon the derived query terms. As will bedescribed in greater detail herein, the deep web miner 112 may match anidentified on-line form to a corresponding “form-understanding plug-in”that understands the format of the on-line form such that the selectedform-understanding plug-in simulates the submission of a query andidentifies relevant result addresses. The crawling component 118 maycorrespondingly analyze the results returned from the iterative miningprocess, e.g., to collect content as will be described in greater detailherein.

Information that is retrieved from the network 104 may be stored in alocal storage 122. Also, according to various aspects of the presentinvention, the deep web miner 112 may build a local navigable copy 124of mined information retrieved from the network 104, e.g., for analysisby analytical tools. Although illustrated separately for purposes ofdiscussion, the navigable copy 124 may be stored within the localstorage 122 or in other practical locations, e.g., on a storage driveassociated with the processing device 102, etc. As will be described ingreater detail below, a user may interact with the user interfacecomponent 114 to configure the deep web miner 112 to broadly mineinformation or to retrieve a tightly focused collection of strictlyrelevant documents.

Referring to FIG. 2, the deep web miner 112 is capable of interactingwith a “surface web” portion 132 of the Internet as well as a “deep web”portion 134 of the Internet. The surface web portion 132 comprises websites and web pages that are readily accessible, and which are typicallylocatable using conventional search engines and/or by providing URLaddresses into a navigation control of a conventional web browser asdescribed above. Moreover, the deep web portion 134 comprises contentthat may be located on intranet sites, private or otherwise securenetwork locations, document repositories, private databases and otherlocations typically not crawled by conventional search engines, such aslocations that are accessed as a result of a executing a query orprocessing an on-line form. Additionally, deep web content may includeunlinked content, script and other executable files, non-hypertextmarkup language (HTML) files such as images, video, audio, PDF files andother types of content that are not otherwise accessible to be crawledby conventional search engines. Such sources of information are notindexed by traditional search engines and are considered the “deep web”because they are generally hidden from the perspective of a searcherusing a conventional search engine.

As will be described in greater detail herein, the deep web miner 112may automatically enter data into forms and submit form-based requestsfor information across the network 104. As an illustration, the deep webminer 112 may interact with online forms that follow a common “searchengine” pattern. However, the number and types of forms found on theInternet are theoretically limitless. For example, forms may be used tocollect usernames and passwords for authentication, collect credit orfinancial information, support information search and retrieval andperform countless other functions. Depending upon the particularimplementation, forms may use recognizable “customary” graphic elementssuch as text boxes and submit buttons, or they may use non-standard ornon-intuitive graphic elements, icons, symbols or other representations.Moreover, form labeling and input may be displayed and accepted inarbitrary languages. Additionally, the positioning of labelingassociated with fields within forms may reside in various proximatelocations relative to the form field entry point. Still further, someforms, such as international dictionaries or language translationservices, accept multiple language input.

Referring to FIG. 3, a method 150 of implementing deep web miningaccording to aspects of the present invention is illustrated. Seedinformation is obtained from a user at 152 where the seed informationprovides sufficient information to define a concept of interest to theuser. In this regard, the seed information may specify or otherwisedefine a “concept space” that will affect a corresponding miningprocess. For example, a user may provide the deep web miner 112 withseed information by specifying a starting URL, topic(s) of interest, oneor more query terms pertaining to the concept of interest, keywords orother significant parameters, etc., before the deep web miner 112submits requests for information using on-line forms, e.g., to issue aquery to a search engine. As will be described in greater detail below,an exemplary approach to obtaining seed information is to provide anabstract search form that is filled out by the user interacting with theuser interface component 114 of the deep web miner 112.

The deep web miner 112 identifies a search engine to utilize forperforming a mining operation with regard to the concept space derivedby the user. For example, the search engine may be selected based upon astarting URL specified with the seed information provided by the user.The search engine may alternatively be selected based upon otherfactors, e.g., using defaults or otherwise derived criteria. The deepweb miner may also map provided seed information to corresponding queryterms/search parameters suitable for use with the identified searchengine. The deep web miner 112 then retrieves a “query page” of a searchengine provided for searching the Internet at 154. The deep web minerthen simulates the entry and submission of a query into the query pageat 156. The submitted query may utilize one or more of the queryterms/search parameters derived from the seed information provided bythe user. As will be described in greater detail herein, submission of aquery may also be based upon parameters derived from an analysis ofprevious search results.

According to various aspects of the present invention, the deep webminer 112 utilizes a custom “form-understanding” plug-in to fill out acorresponding on-line form and process the results returned from anissued query to that corresponding on-line form. In this regard, eachunique form found on the Internet may utilize a corresponding uniqueform-understanding plug-in where each plug-in understands the form thatit is designed to automatically fill in and submit. Alternatively, aform-understanding plug-in may be generic to one or more forms, as willbe described in greater detail herein. Also, one or more plug-ins may becustomizable, e.g., by a user or other third party so as to define theparameters that are needed by the deep web miner 112 in order to issuequeries to and process results from arbitrary forms or predeterminedtypes of forms. Still further, an extensible plug-in architecture may beutilized such that users and/or developers can expand or add to thecapabilities of the plug-ins, such as by providing the capability to addnew plug-ins, modify existing plug-ins, delete obsolete plug-ins, etc.Further, although described with reference to plug-ins for convenienceof illustration, other approaches may be utilized to convey queryterms/search parameters to forms including search engines, etc.

To simulate entry and submission of the query, the deep web miner 112may thus utilize an appropriate form-understanding plug-in to performthe above-described mapping from abstract search form provided by theuser, e.g., the seed information, to query terms/search parametersformatted to the online form of the specific search engine that theplug-in services. For example, the seed information or otherwisepreviously determined search terms may not be in a format that isdirectly compatible with a corresponding form or query syntax. However,the seed information may be converted to properly formatted queryterms/search parameters that are further mapped to the appropriatefields of the form to implement a search.

The deep web miner then retrieves the “results” page(s) returned fromsimulating a query and identifies “relevant” result URL(s) from the pagefor subsequent processing at 158. In this regard, the selectedform-understanding plug in may know how to properly format queryterms/search parameters and map them to appropriate fields, submit thequery, and extract relevant result URLs from non-result, site specificand other information. For example, the selected plug-in may recognizethat banners, advertisements and other information in the returned webpage are not search result anchors and are thus not relevant. If morethat one result page is available, the deep web miner can obtainadditional results pages, such as by simulating the selection of a“next” results control or by utilizing other tools provided on thesearch results page or by the search engine for navigating results.

As will be described in greater detail herein, according to variousaspects of the present invention, the deep web miner uses the plug-insto retrieve URLs from web page search engines/on-line forms. In thisregard, depending upon the on-line form, the user may have no controlover what links a search engine will respond with in regard to acorresponding query. For example, public search engines each index andorganize their data differently. However, regardless of the manner inwhich a particular search engine generates its response URLs, theappropriate form-understanding plug-in obtains relevant results and handthis information over to crawlers that return the content at theretrieved URLs. The information returned by the crawlers is thenanalyzed to generate statistics, which are used to issue subsequentqueries. Thus, an iterative process is utilized where new queries aregenerated based upon crawler generated data. Moreover, statistics may beutilized to decide what pages are worth pursuing and which are not.

For example, as will be described in greater detail herein, a user mayset breadth or depth limits on the deep web miner 112. However, suchconstraints may be automatically overridden, such as where the systemdetermines that additional pages (breadth or depth) are relevant to theconcept space of the user.

The deep web miner 112 then crawls the results to obtain one or morehyperlinked web pages, associated content and/or metadata at 160, whichmay include structured and/or unstructured documents, files, media, etc.The crawled results may also include, for example, HTTP transactionalmetadata that is usually hidden by browsers. For instance, based oncaptured HTTP transactional data, the deep web miner 112 may determinewhat type and version of HTTP server was used, or when an image was lastupdated.

There are many possible strategies for capturing online data from asearch engine. If the search engine services a domain with a small,limited number of pages, the user may wish to capture every possiblepage that could be returned by the search engine. That is, the user maynot care to narrow the search with topics. Alternatively, the user willprobably want to limit the search if the search engine servicesthousands or millions of domains. Accordingly, the user interfacecomponent 114 of the deep web miner 112 may allow the user to specifyinformation related to content retrieval, e.g., by specifying themaximum number of query results that are captured, the maximum hyperlinkdepth to crawl, etc.

As a few illustrative examples, the user may want the deep web miner 112to collect the result page that is identified by each query result URL,and nothing more. Alternatively, the user may wish to collect eachresult page and then explore the pages that are hyperlinked-to by theresult page. As such, the deep web miner 112 may support linkexploration. The deep web miner 112 may provide one or more options forcontrolling link exploration. For example, link exploration can beconstrained by total number of links, link depth, URL domain, relevanceof page content, etc. In this regard, by limiting the number of resultpages that are captured, and by controlling subsequent link exploration,the user may define a custom strategy for capturing content.

The deep web mining process may be performed in an iterative manner.That is, the deep web miner can analyze the returned results at 162,such as to derive new query terms/search parameters. These new terms canbe utilized to continue to submit new queries and analyze the resultsthere from. Based upon the analysis of the search results, newcontent-based query terms may be generated. The optional generation ofnew content-based query terms may comprise adding new terms, modifyingexisting terms, deleting existing terms etc., if desired by the specificimplementation and if possible, e.g., based upon the nature of thereturned results. If new terms are generated, those new terms may beused to update the query terms/search parameters for continued iterativeprocessing, e.g., by looping back to 154.

Moreover, the results obtained by deep web mining processes may bedynamically conveyed at 164. For example, the conveyance may comprisebuilding a dynamically changing local copy of the mined data and/orcorresponding metadata. By dynamically updating the results of themining process, e.g., as the information is captured, the user can thusinteract with the results for exploration and analysis, even while themining process continues to iterate, i.e., before the search processitself is complete. In this regard, the local navigable copy may belimited to the extent that links within the navigable copy to networkresources that are on the network outside the local navigable copyitself may not function properly. That is, the extent of the navigablecopy may be limited to the scope of received search results.

The conveyance may also comprise providing feedback of the searchprocess to the user, such as by updating information on a display devicethat interacts with the user interface component 114. Various aspects ofthe method 150 are described in greater detail herein.

A determination is made at 166 as to whether a stopping event has beendetected. As will be described in greater detail herein, the stoppingevent may include a user imposed link exploration restraint based upon atotal number of links, a link depth, a relevance of search results, etc.Moreover, user defined depth constraints may be overridden if queryconstraints are satisfied in certain implementations.

Thus, the deep web miner may continue to collect results URLs and/orcrawl corresponding results until a stopping event is detected. Astopping event may include detecting that no more URLs are available,detecting a command to stop the deep web miner, detecting a command toissue a new query, etc. If no stopping event is detected, thenprocessing continues as described more fully herein. If a stopping eventis detected, then the process is ended at 168.

Referring to FIG. 4, a system diagram illustrates an exemplary logicalimplementation 170 of the deep web miner 112 and its interaction acrossa network according to various aspects of the present invention. Thesystem diagram may be utilized, for example, to implement the methoddescribed with reference to FIG. 3. As noted above, the illustrated deepweb miner 112 includes a user interface component 114, a miningcomponent 116 and a crawling component 118.

The user interface component 114 provides a graphic user interface thatallows the user to interact with the deep web miner 112, such as forentering seed information, monitoring and/or directing the mining/dataretrieval process, for interacting with the results and/or forperforming any other processes or functions implemented by the deep webminer 112. For example, the user may interact with an abstract form 172to provide information that is utilized to initiate a deep miningoperation. Also, the user may utilize additional software tools such asanalytical applications, visualization applications, web browsingapplications, etc., to dynamically interact with the results in additionto or alternatively to the user interface component 114 of the deep webminer 112. Exemplary screen shots of the user interface are describedmore fully herein.

The mining component 116 further comprises a mining parameters component174 and a plug-in component 176. In practice, the mining parameterscomponent 174 may be integrated into the plug-in component 176. Themining parameters component 174 organizes the search terms that may beutilized to fill in fields of on-line forms. The plug-in component 176comprises one or more “form-understanding plug-ins” as described withreference to FIG. 3, where each plug-in is configured to understand oneor more on-line forms. In this regard, a selected plug-in from theplug-in component 176 maps the appropriate query terms/search parametersfrom the mining parameters component 174 to the corresponding on-lineform that the particular plug-in services, as described more fullyherein.

The illustrated crawling component 118 includes a content retrievalcomponent 180 and an analysis component 182. The content retrievalcomponent 180 obtains data from the Internet based upon the relevantresult URLs identified by the plug-in component 176. The contentgathered by the content retrieval component 180 is stored in the localstorage 122 as will be described more fully herein. The analysiscomponent 182 may analyze the gathered content, such as to generate,modify and revise the query terms/search parameters maintained by themining parameters component 174.

In operation, the user utilizes the user interface component 114 toprovide seed information, e.g., using an abstract form 172. Based uponthe seed information, the mining component 116 selects the correspondingform-understanding plug-in and retrieves the query page of the selectedon-line form 184. The form-understanding plug-in then simulates theentry and submission of a query in the actual form 184, e.g., based uponone or more of the parameters stored in the mining parameters component174 by mapping the derived query terms/search parameters to the onlineform to make forms-based requests for information. The query enteredinto the query page of the actual on-line form is submitted to a formprocessing device 186, such as a search engine, and the results thereofare communicated back to the deep web miner 112.

The deep web miner 112 may obtain content for all result URLs returnedby the form processing device 186 that are recognized by the selectedform-understanding plug-in. Alternatively, the deep web miner 112 mayconstrain the number of result URLs for which content is gathered, e.g.,based upon user defined preferences that are established using the userinterface component 114. For example, as noted above, the search enginemay service thousands or millions of domains. As such, the userinterface component 114 of the deep web miner 112 may allow the user tospecify the maximum number of query results that are captured.

The result URLs are passed to the content retrieval component 180, whichobtains their corresponding content and optionally extracts hyperlinkURLs therein to gather additional content 188, a process commonlyreferred to as “crawling”. In this regard, the deep web miner 112 mayexplore not only the surface web 132 but also the deep web 134. Thegathered content 188 may comprise, for example, web pages, documents andother files, including media files such as graphics, video and audiofiles, scripts and other executable programs, etc. Additionally, content188 retrieved by the content retrieval component 180 may includemetadata. For example, the content retrieval component 180 of the deepweb miner 112 may capture the result page corresponding to each relevantquery result URL, and nothing more. Alternatively, the user may wish tocapture each result page and then explore the pages that are linked-toby each result page. As noted above, according to aspects of the presentinvention, the user interface component 114 may be utilized to allow theuser to define a strategy for capturing content and thus control themanner in which link exploration is implemented. Thus, link explorationmay be constrained, e.g., by the total number of links, by link depth,domain, relevance of page content, etc. Link exploration may also beconstrained by limiting the number of result pages that are captured.

The content 188 obtained by the content component 180 may be analyzed bythe analysis component 182, so as to modify the search terms provided bythe mining parameters component 174, which used to submit to the actualform 184. Moreover, the information returned from crawling operationsperformed by the content retrieval component 180 may be stored to localstorage 122, e.g., by constructing a local navigable copy of the resultsas set out more fully herein. Moreover, the results may be dynamicallyconveyed to the user interface 114 so that the user can interact withthe stored content while the deep web miner 112 iterates the search.

The content retrieval and analysis module 180 may also analyze theretrieved content. The results of this analysis are utilized by thequery generator 182 to generate new content-based query terms 26 whichare then used to update parameters maintained by the mining parameterscomponent 176.

Deep Web Mining Tasks:

According to various aspects of the present invention, the deep webminer 112 maintains a collection of “Tasks”. Each task may embedabstract query parameters and collection parameters that support asingle collection effort. Thus, a task may be initialized and ready forexecution, executing, complete, initiated, paused, saved, etc.Correspondingly, a user may select previously saved tasks, which canthen be re-initialized and/or re-executed/re-started. In such cases, anypreviously captured content may be discarded, archived, or otherwisesaved. According to various aspects of the present invention, the deepweb miner 112 may also be threaded so that multiple tasks can executeconcurrently. The utilization of threads may provide improvedperformance and/or other performance benefits, for example, when manytasks must access web sites at distant locations and/or when tasksexperience slow communications throughput. As such, the deep web miner112 may create at least one deep mining thread associated with a definednew task to perform a mining process as set out in greater detailherein.

Cookie Handling:

During normal web browsing, a visited server may return cookies forlocal storage on the processing device hosting the deep web miner 112.Proper cookie handling is necessary for many websites to functioncorrectly and predictably. For example, a conventional system thatutilizes multiple web browser instances to concurrently explore a URLmay consolidate or overwrite the cookies associated with each browserinstance. According to various aspects of the present invention, thedeep web miner 112 manages multiple isolated cookie spaces. For example,three cookie spaces may be managed per task to prevent unintendedconsolidation or overwriting of cookie spaces. The cookie spaces mayinclude a first cookie space for deep web mining forms processing, asecond cookie space for link exploration (web crawling) and a thirdcookie space for isolated browsing, which is described more fullyherein. Depending on how each task is configured, these three cookiespaces may or may not be independent and isolated. Rather,user-configuration of cookie-space isolation may be implemented.

Output from the deep web miner 112 may be stored in multiple places. Forexample, as noted in greater detail herein, content including documents,media, executable code, etc., may be stored in a local file system, suchas local storage 122 as a local navigable copy 124 of the contentobtained by crawling locations across the network 104. Moreover, thedocuments, content and media may be mapped from task and URL by anembedded relational database. Also, metadata such as HTTP transactionalmetadata may be stored directly in a corresponding database. Oncestored, the deep web miner 112 may provide capability to search,analyze, navigate, graph or otherwise manipulate or interact with thecaptured content, such as via an operator interacting with the userinterface 114.

For example, for each executing or completed task, the deep web miner112 may provide a tree component or other visual metaphor that showseach captured page and its role. The user may select a task and thenclick on tree nodes in order to browse captured pages with aconventional web browser in “isolation”. In isolation, the browser isblocked from requesting any page that has not already been captured bythe task.

Referring to FIG. 5, according to various aspects of the presentinvention, the deep web miner 112 may iteratively continue nestedprocessing cycles including forms submission and query at 192, URLresults gathering at 194 and corresponding content gathering/crawling at196 by crawling the URLs, with task-dependent user-definable terminationcriteria. A stopping event may be defined by running out of informationto crawl, receiving a request for a new search or otherwise meeting apredetermined stopping criteria/criterion set by the user. For example,the user may specify a predetermined number of pages or links to follow.The user may also limit the size or types of information that isreturned, etc. by setting user preferences in the user interfacecomponent 114. Moreover, other sequences may be utilized to perform deepweb mining using the systems and techniques described more fully herein.

Exemplary User Interface Component:

Referring to FIG. 6, a screen shot 202 illustrates an exemplaryimplementation of aspects of the user interface component 114 of thedeep web miner 112, wherein a user has started the deep web miner 112,e.g., for the first time, and has no defined tasks. Referring to FIG. 7,after opening the deep web miner 112, a user may open a dialog 204 tocreate/define a new task corresponding to a concept space associatedwith a topic of interest to a user. In the illustrated exemplary dialog204, the user may provide a name for the task at 206.

Moreover, the user may specify seed information, e.g., in a query tab208. For example, the user may identify an on-line form at 210 to beginthe iterative searching process. The identified form is then matchedwith a corresponding form-understanding plug-in as described in greaterdetail herein. The user may also enter search terms at 212. For example,as shown, the user has entered the term “Ebola” as a query term. Theuser may also be able to specify constraints at 214 and at 216, e.g., toconstrain various mining and/or crawling parameters.

As illustrated, the user has constrained the crawled URL domains tomatch the highest-level 2 domain segments of the result URLs obtained bythe deep web miner 112. For instance, if the search engine returns theresult URL “www.cdc.gov/ncidod/dvrd/spb/mnpages/dispages/ebola.htm”, thehighest-level 2 domain segments are “cdc” and “gov” so that crawling issubsequently constrained to explore links within the “cdc.gov” domain.The crawlers will thus not explore links in the “amazon.com” domain inthis example. Although the deep web miner 112 obtains seed informationwith regard to the concept space including an on-line form and at leastone search term in this example, other arrangements for obtaining seedinformation may alternatively be implemented as described more fullyherein.

Referring to FIG. 8, a screen shot illustrates an exemplary KeyGen tab218 of the dialog 204 to set up user defined keyword generationparameters 220. As shown, the user has altered the 2-gram frequencycutoff percentile (designated ‘%-tile’ in the figure). 2-gram frequencywill be described in greater detail herein. Referring to FIG. 9, ascreen shot illustrates an exemplary Capture tab 222, which is utilizedto specify user parameters 224 regarding Crawler and Media-Capturethreads. Herein, the user may set limits, such as on the maximum numberof query results obtained per query issued, the maximum number ofcrawler visits per result, maximum size of files to collect, e.g., forHTML and/or non-HTML documents, media handling limits, thread processinglimits, etc. Referring to FIG. 10, a screen shot illustrates anexemplary Cookies tab 226, which is utilized to specify user parameters228 regarding Cookie privacy policies.

Referring to FIG. 11, a screen shot illustrates an exemplary display 230wherein the new task 232, designated “Ebola”, is defined in the presentexample. Even though the task is selected, the lower-results pane isempty because the task has not yet been executed. Referring to FIG. 12,a screen shot illustrates the exemplary display 230 after the “Ebola”task has been started using task controls 232. The results of the deepweb mining process are displayed in results pane 234. As shown, thehighest-level of the results tree illustrated with magnifying glassesicons are the deep-miner queries. The next-to-highest level listings arethe direct search results. The 3rd and deeper levels of the tree arecrawled results. According to various aspects of the present invention,crawled results may override the user-defined depth constraint becausethey satisfy the query constraints. Such results may thus bedistinguished in the results pane 234, such as by color, indicia, etc.In an illustrative example, crawled results that override user-defineddepth constraints are displayed in green. According to various aspectsof the present invention, selecting any URL in the displayed resultspane 234 may open a web browser in an isolated local virtual web-spaceto view collected content corresponding to the task associated with the“Ebola”.

Referring to FIG. 13, a screen shot illustrates an exemplary screendisplay wherein the “Ebola” task 232 has been stopped and a new task236, designated “Anthrax” has been created for purposes of illustration.Referring to FIG. 14, a screen shot illustrates the exemplary display230 after the “Anthrax” task 236 has been started. In this exemplaryscreen shot, the “Anthrax” task 236 is selected in the upper pane, and“Anthrax” results are shown in the results pane 234. If the user clicksto select the “Ebola” task 232 in the upper pane, then “Ebola” resultsare shown in the results pane 234. Clicking on any result in the lowerpane displays the collected web pages within the isolated virtualweb-space that is associated with that particular task.

The User Interface Component—Mining Component Exchange:

Referring to FIG. 15, a screen shot 240 is illustrated of an exemplaryon-line form 184A, such as an accessed form 184 described more fullyherein with reference to FIG. 4. The form 184A may be accessed forexample, by targeting the URL entered at 210 in the query tab of thetask dialog 204. This illustrative type of form is typical to what auser would see when using a traditional search engine. Forms such asthese are accessed and populated by the form-understanding plug-inscomponent of the deep web miner 112 to initiate searches, as describedmore fully herein.

Referring to FIG. 16, a form-understanding plug-in has been selectedfrom the plug-in component 176 that “knows” the illustrated form.Keeping with the above example, assume that the user has selected atopic of interest such as “anthrax”. The user has thus provided seedinformation to the deep web miner 112 which includes this topic ofinterest “anthrax”. The derived query terms/search parameters are mappedto appropriate fields on the actual form 184A by interaction between theselected form-understanding plug-in and the form. As a result, theexemplary search form is populated with properly formatted search terms242 in the appropriate field(s) and the form-understanding plug-intriggers the search to be conducted, such as by implementing anappropriate submission technique, e.g., activating the “search button”244 provided on the form. The deep web miner 112 thus automaticallysubmits the query to the search engine to execute the search.

Referring to FIG. 17, a screen shot illustrates a partial listing of theresults 246 of the executed search from FIG. 16. In general, the deepweb miner 112 retrieves the results page of the search and analyzes thepage information for relevant content. For example, depending upon userpreferences, relevant query result URLs from the search may be obtainedfor subsequent crawling. In this regard, the processing may requireobtaining more than one page of search results. If more results from theexecuted search are available than can be displayed on the exemplaryresult page, then the deep web miner 112 can continue iterativelyretrieving additional results via interaction between the selectedform-understanding plug-in and the targeted site, e.g. by simulating theactivation of the “NEXT” link 248 or similar links on the results page246, or through any other appropriate interactions with thecorresponding form processing engine 186.

Referring to FIGS. 18A-B generally, a block diagram 250 of an exemplaryimplementation of the deep web miner is illustrated according to variousaspects of the present invention. In the illustrated implementation, auser begins by creating a deep web mining task, loading seed informationand starting the task as described more fully herein. In responsethereto, a user-interface thread creates a single new “deep-miningthread” to execute deep-mining activities on behalf of the task. Thedeep-mining thread creates a pool of crawler and content-service threadsand holds initial query parameters that are used to generate one or moresimple queries. The flow of processing is as follows:

The deep-mining thread generates one or more queries at 252, e.g., usinga query generator. Each query is generated from keyword information thatis specified entirely within the task, e.g., from user provided seedinformation and/or generated keywords. The deep-mining thread alsoqueues the queries at 254 for subsequent processing in the query queue.

The task declares a specific implementation of an abstract forms-basedquery service at 256. For example, the task may declare a specificimplementation of an abstract forms-based query service in acorresponding content-service thread that executes a deep mining processby matching an identified on-line form to a correspondingform-understanding plug-in that understands the format of the on-lineform. In this regard, the selected form-understanding plug-in simulatesthe submission of a query and identifies relevant result addresses.

The abstract forms-based query service provides a simple, uniforminterface for all implementations. In this regard, implementations maybe realized by form-understanding plug-ins which are discovered when thedeep web miner 112 is initialized as described more fully herein. Thedeclared implementation transforms the query into an appropriate networkrequest, e.g., an HTTP request, and transacts with an HTTP transportcomponent at 257 to retrieve one or more query result pages. Queryresult pages are ultimately transformed into a stream of individualresult URLs.

After initialization and generation of queries, the deep-mining threadimplements a steady state (SS) monitor at 258 that iterates until thetask is complete. The SS monitor invokes the form-based query service toretrieve the individual result URLs. As noted in greater detail herein,the maximum number of result URLs may be limited by a task parameter,e.g., a mining parameter, which can be provided by the user when thedeep web miner task is created or can be limited by default parameterswithin the deep web miner 112 parameters listings. When the limit isreached, if utilized, the SS monitor may attempt to generate additionalqueries. If the limit is not reached, but the form based query serviceis unable to provide sufficient result URLs, then the SS monitor mayrequest that additional queries be generated.

Next, query result addresses are queued in a crawler queue. For example,the SS monitor may invoke a crawler method to push individual resultURLs onto the head of a crawler queue at 260. The crawler maintains apool of threads at 262 that asynchronously service the URLs. Accordingto various aspects of the present invention, while the crawler URL queueis empty, all crawler threads may sleep. When URLs are queued in thecrawler queue, crawler threads are awakened to service them. If allcrawler threads are busy, then additional URLs remain queued untilcrawler threads become available to handle them. If a crawler threadcompletes processing its URL and there are no more URLs in the queue,then the thread goes to sleep.

Each result address may be asynchronously serviced by a correspondingcrawler thread that obtains content and/or metadata that is cached in alocal storage medium. For example, each crawler thread may pull andservice a URL from the tail of the crawler queue. In this regard, thecrawler attempts to retrieve the content associated with the URL at thecontent retrieval component at 264. Retrieved content may be storeddirectly in a file system, e.g., the local storage 122 as described withreference to FIG. 1. HTTP transactional meta-data may also be stored atthe HTTP transport layer, e.g., in a relational database managementsystem (RDBMS) at 123, as described more fully herein.

The system processes the content of the returned results and may updatea display with a listing of the mined results, e.g., by updating theuser interface 114 wherein the user may browse a local navigable copy ofthe crawled results in isolation by selecting a navigable entry of thelisting.

According to various aspects of the present invention, retrieved contentundergoes a processing workflow. Initially, “Content” consists of abuffer of bytes. Content may then be processed by a sequence of one ormore “processors”. For example, each processor may be associated with adifferent returned file type. When a processor is done processing itscontent, it may invoke one or more additional target processors. In thisway, processors do a bit of work and then feed their results into otherprocessors.

In the present illustrative example, there are three types of retrievedcontent, including raw content that consists of bytes or characterstrings, structured HTML object hierarchies, which are also referred toas document object models (DOMs), and structured text documents. Inpractice, other types of retrieved content may also/alternatively bedefined.

As used herein, processors that consume raw content are referred to as“Content Processors” and may implement a standard interface, designatedherein as IContentProcessor. Exemplary IContentProcessors include anHtmlContentProcessor 266, a CssContentProcessor 268 and aPdfContentProcessor 270.

Processors that consume DOMs are referred to herein as “DOM Processors”and implement a standard interface designated IDomProcessor. ExemplaryIDomProcessors include an HtmlMediaCollectorDomProcessor 272, aDocumentBuilderDomProcessor 274 and a SequencerDomProcessor 276.

Processors that process structured text documents are referred to hereinas “Document Processors” and implement a standard interface designatedIDocumentProcessor. Exemplary IDocumentProcessors include aDebugDumpDocumentProcessor 278, an XmlDumpDocumentProcessor 280, aWordStatsDocumentProcessor 282 and anInvariantPhraseScrubberDocumentProcessor 284.

Some processors may be utilized to transform one type of content intoanother. For example, an HtmlContentProcessor at 266 may build a DOMthat is passed to a target, e.g., an HtmlMediaCollectorDomProcessor 272.This system of processors may be utilized, for example, where each typeof document, such as HTML, PDF, cascading style sheets (CSS), characterstrings, structured text documents, etc., requires different treatmentto access and collect the information found therein. As such, aplurality of processors may be utilized in the deep web miner workflow.See for example, the table set forth below for an exemplary collectionof processors.

TABLE 1 Extracts Collects Name Input Output Target URLs MediaDescription HtmlContent- Content DOM Sequencer- Yes No Parses HTMLProcessor (text/html) DOM Dom- content Processor or Builds DOMHtmlMedia- Collects URLs Collector- for crawler Dom- ProcessorPdfContentProcessor Content Document WordStats- Yes No Parses PDF(application/ Document- content pdf) Processor Builds Document CollectsURLs for crawler CssContent- Content Yes Yes Parses CSS Processor(text/css) content Builds flat DOM Collects URLs for crawler Retrievesand stores CSS- referenced media SequencerDom- DOM DOM Document- No NoCollects and Processor BuilderDom- queues DOMs Processor from multiplethreads Processes queued DOMs sequentially, from a single threadPrevents race conditions in thread-unsafe code HtmlMedia- DOM File(s) NoYes Examines CollectorDom- DOM for Processor references to mediaRetrieves and stores referenced media Document- DOM Document Invariant-No No Extracts BuilderDom- Phrase- HTML element Processor Scrubber-content Document- Ignores style Processor and script or contentWordstats- Inserts implicit Document- line breaks Processor Constructsstructured text Document objects InvariantPhrase- Document DocumentWordstats- No No Buffers Scrubber- Document- structured text Document-Processor Documents Processor Removes invariant phrases such as headers,navigation labels, etc. Wordstats- Document Document XmlDump- No NoCollects 2- Document- Document- gram word Processor Processor statisticsor none within phrases. XmlDump- Document File DebugDump- No No ExportsDocument- Dom- compiled text Processor Processor analytic or nonemetadata to XML files. DebugDump- DOM File No No Dumps debugDomProcessor information to log file(s) DebugDump- Document File No NoDumps debug Document- information to Processor log file(s)

The crawler maintains registries for content processors, e.g.,processors at 266, 268, and 270 that implement a standard interfacedesignated IContentProcessor. The crawler also maintains registries forprocessors that consume DOMs, e.g., processors at 272, 274 and 276 thatimplement a standard interface designated IDomProcessor. The registry ofcontent processors may be keyed by type, e.g., the Multipurpose InternetMail Extensions (MIME) type in the illustrative example. Afterretrieving the content associated with a URL, the crawler examines thecontent MIME type and uses a MIME-based selector at 286 to dispatch thecontent to the correct content processor. The deep web miner 112 maythus support processes such as HTML, CSS, and PDF, MIME types, althoughadditional and/or alternative types may be supported. If no processor isfound for the MIME type of the content, then content is not processedany further.

As noted above, the illustrative IContentProcessor contains severaldifferent processors of which three exemplary processors will beexplained herein. The HtmlContentProcessor at 266 may use an open sourceHTML parser to build a hierarchical “document object model” (DOM) of theHTML, which allows for detailed structural analysis. TheHtmlContentProcessor at 266 forwards the DOM to every IDomProcessor inthe crawler's registry, including the HtmlMediaCollectorDomProcessor at272 and the SequencerDomProcessor at 276. The CssContentProcessor at 268may use a primitive parser to build a flat document model that exposesreferences to media, and to other included cascading style sheets. TheCssContentProcessor at 268 collects media (e.g., images). TheCssContentProcessor at 268 may also extract references to nestedcascading style sheets and feed them back to the crawler queue forsubsequent crawling. The PdfContentProcessor at 270 may extract textfrom PDF documents and scan the extracted text for substrings that aresyntactically valid URLs. The URLs extracted from these processors maythen be fed back to the crawler queue 260 so that crawling of additionallinked content may continue. The text content extracted may be composedinto structured documents subject to one or more Document Processors.

Before a DOM is built, the web page data is raw content. If the web pageconsists of HTML, then the HtmlContentProcessor at 266 builds a DOM andforwards it to the HtmlMediaCollectorDomProcessor at 272 for additionalprocessing. The HtmlMediaCollectorDomProcessor at 272 examines all HTMLelements and identifies those with references to non-crawlable externalmedia such as images, video, script, audio, etc. As noted in greaterdetail herein, a pool of threads may be used to collect and storeexternal media in local storage 122. Moreover, the mappings of URLs tomedia files may be stored in the embedded database 123, e.g., the RDBMS,by a cache manager 288.

To allow for the unpredictability of network communication latency andthroughput, content processing is performed on multiple crawler threadssimultaneously, e.g., using the SequencerDomProcessor 276. This mayprevent for example, work to stall, such as where a single process iswaiting for communication with slow websites. However, after dataretrieval is complete, multiple threads no longer serve a usefulpurpose. To the contrary, multiple threads may decrease efficiency inCPU-bound processing. Thread-safe code is also more difficult todevelop, debug, and maintain. However, the above issues are avoided bycollapsing the multiple crawler threads to a single thread, e.g., afterdata retrieval is complete. For example, the above issues are avoided bycollapsing the SequencerDomProcessor workflow multi-threading into asingle thread, rather than support multi-threaded processing. However,other configurations may alternatively be implemented.

The first stages of DOM processing identify URLs that are crawlable viathe DocumentBuilderDomProcessor at 274, as well as reference pages andmedia that need to be collected. Collecting relevant pages and media aretasks suited to the deep web miner 112, and to maximize data collection,the deep web miner 112 can generate new queries in the deep miner. Thisis done by analyzing the text collected while the deep web miner 112 isiterating. For example, the deep web miner 112 may be given seedinformation requesting information about the topic “anthrax”. The deepweb miner 112 may collect numerous web pages concerning “anthrax”.Moreover, for further exploration, the deep web miner 112 may need todecide what concepts are related to “anthrax”. To accomplish this task,the deep web miner 112 may be required to analyze the text content ofthe collected pages. The use of the DOM allows the deep web miner 112 toseparate text content from HTML markup. Each HTML element may have textcontent.

However, some HTML elements, such as SCRIPT and STYLE have text contentthat is not domain content. The DocumentBuilderDomProcessor at 274extracts text content, and forms it into structured text documentobjects, which are hierarchical structures that expose the linguisticorganization the text.

In this regard, the content of the returned results may be processed byidentifying the text content of returned results, performing alinguistic organization of the identified text, identifying new termsassociated with the corresponding concept space and iterativelyrepeating the mining process until a predetermined stopping event isdetected.

Referring to FIG. 19, a linguistic organizational breakdown is shown. Adocument may contain an ordered sequence of child contexts. A contextmay contain an ordered sequence of phrases and a phrase may contain anordered sequence of tokens. From this structure, it is relatively easyto determine which tokens appear in the same documents, contexts, orphrases.

Referring back to FIGS. 18A-B, the text content of a web page maycontain uninformative boilerplate. Boilerplate may include headers,labels for navigational links, copyrights, legal warnings, and so forth.Boilerplate text is generally not related to the user-specified topicsof interest and contaminates text-based statistics as will be describedin greater detail herein.

An InvariantPhraseScrubberDocumentProcessor at 284 compares structureddocuments, identifies boilerplate texts, and removes them. The finaltext analytic stage of document processing, e.g., theWordStatsDocumentProcessor at 276 involves collecting frequencystatistics on individual tokens as well as frequency and weightedproximity statistics on pairs (2-grams) of tokens that co-occur withinphrases. Statistics may then be used to identify tokens that correlatewith the user provided seed information. For instance, the words such asbreathing, transmission, vaccine, bacteria, and CDC are highlycorrelated with “anthrax”. The WordStatsDocumentProcessor at 282collects statistics expressed as tokens in document structures. As a fewillustrative examples, the analysis aspects according to various aspectsof the present invention may attempt to locate words that are near agiven key word where such additional words are not close to other words.In this regard, a ranking of pairs of words may be created. Thus, termssuch as Ebola+fever may be heavily exploited and rank near the top ofthe list. As such, this pairing may be deemed as not worth searching asthe pair is too highly correlated. Rather, the system may jump somewherespaced from the top of the keyword pair list, e.g., towards the middleof the pair listing. As an example, the system may select the 60%-80%span of ranking to considering secondary search terms.

During processing, a large amount of meta-data may be generated. Some ofthe meta-data may be exported to XML documents for unspecified externalprocessing. This process is performed by theXmlDumpDataDocumentProcessor at 280. The final step in the workflowhandled by the illustrated crawler is processed by theDebugDumpDocumentProcessor at 278, which outputs debugging informationto log file(s).

The SS monitor at 258 observes the number of query result URLs that areprocessed by a 2-Gram Word Frequency Model component at 290, and thenumber of crawled URLs that are processed. When crawling is complete andeither no more query results exist, or the user-specified limit on thenumber of query results has been met, the SS monitor at 258 may requestgeneration of additional queries using the results of theWordStatsDocumentProcessor at 282. When additional queries aregenerated, frequency and 2-gram statistics are drawn from the 2-gramword frequency model at 282. This model is built by theWordStatsDocumentProcessor at 282 and is forwarded to the 2-Gram WordFrequency Model component at 290.

According to aspects of the present invention, paired queries aregenerated by the Pair Query Generator at 292. The user may determinewhether or not the deep web miner 112 should generate additionalqueries. If additional queries are to be generated, the user maydetermine which queries to generate, such as paired queries, chainedqueries, or both. The user may also control mining parameters using theuser interface to control the generation of additional queries and/or tootherwise steer the deep web mining process. If queries are generated,the deep web miner 112 may execute the task until it is stopped, such asby the user.

The user interface may provide a work area for the user to browseresults that have been captured. As illustrated, the work area includesa UI Model at 294, a UI Controller at 295 and a UI View at 296. The userinvokes the interface for example, by selecting a task, and thennavigating a tree widget to a result URL as described more fully herein.Under this configuration, clicking on a result URL may launch aninstance of a web browser and display the result.

The web browser at 297, such as Internet Explorer by MicrosoftCorporation of Redmond Wash., may be operated in a modified windowsenvironment. For example, when the environment is prepared, a hook maybe set in the registry that redirects web browser transactions throughan HTTP proxy server at 298. In this regard, transactions withinpreviously opened web browser windows are not affected by theredirection. On shutdown, the original windows environment may berestored.

While the windows environment is modified, all web browser transactionsmay be directed to an HTTP proxy server at 298 that is part of the deepweb miner 112. The proxy server at 298 may examine each requested URLand determine whether or not the URL is in the deep web miner's pagecache 288.

If the URL is found in the cache, e.g., by the cache manager at 288, thecontent is located on the file system 122, the original HTTPtransactional meta-data is restored and the content is delivered inresponse to the HTTP request. If the URL is not found, then anappropriate status code, such as an HTTP status code of “403-Forbidden”may be returned. The modified environment thus prevents unintentionalaccess to the original network data source in the event that theworkstation network is enabled. While browsing through the DWM proxyserver, all HTTP requests may be matched both by URL and by thecurrently selected task.

According to various aspects of the present invention, workflow isdefined by nested query, results, and crawling cycles withtask-dependent user-definable termination criteria. Various aspects ofthe deep web miner may provide collection of abstract query terms withexecution-time mapping to web-page implemented terms by plug-informs-based query services. Moreover, the deep web miner may beconfigured to detect and crawls URLs embedded within PDF, CSS and otherforms of documents and files, performs text-analytics against PDFcontent, etc.

Various aspects of the present invention provide the ability toconstrain results pages to the internet domain, or any super-domain ofthe search engine. Various aspects of the present invention furtherprovide the ability to constrain crawled pages to the domain, or anysuper-domain of the search engine or the domain or any n-segments of thedomain of any results page. Moreover, a constrained crawling depth maybe relaxed by satisfaction of abstract query parameters. Various aspectsof the present invention further provide the ability to specify a numberof threads for crawling and media collection during task execution.

Still further, various aspects of the deep web miner provide the abilityto consolidate throttling across tasks that access the same queryservice implementation. This may be utilized, for example, to simulatethe frequency and speed at which humans may access a corresponding queryservice, where such may be required to ensure successful queryimplementation thereof.

Still further, as noted above, various aspects of the deep web minerprovide lexical processing of HTML text content including constructionof text Document structure, the detection and removal of Invariantphrases, 2-gram word frequency within phrase, weighted by proximity, andtechniques to find words with strongest correlations to words withindisjunctive and conjunctive sets of words.

Referring now to FIG. 20A, an example illustrates a technique togenerate “paired queries” with parameterized relevance ranking limits asnoted previously. A single paired query takes an existing query andnarrows it by pairing it with a single additional conjunctive term thathas been determined to be weakly correlated with all existing primaryterms. Keeping with the above example, assume that a search is conductedusing a search engine that returns a significantly large number ofpages, e.g., related to “anthrax”. If the deep web miner is configuredto return less than the entirety of search results, e.g., a smallpercentage of the search results, then the returned pages may be chosenbased on some statistical measure, e.g., the number of times that“anthrax” appears in the content of each page. Consequently, the minedpages may cover a very narrow range of concepts related to “anthrax”.

Assume as yet another example, if the deep web miner captures aplurality of pages, e.g., 100 pages, it may be determined that the terms“breathing” and “transmission”, are related to “anthrax”. Thus, the deepweb miner's Pair Query Generator 292 issues queries “anthrax ANDbreathing”; “anthrax AND transmission”, etc. The effect, after many suchpairings, is to broaden the range of explored concepts.

Referring now to FIG. 20B, according to various aspects of the presentinvention, the user may be able to control the breadth of the minedconcept space by controlling how closely paired concepts, e.g.,“breathing” or “transmission” must relate to the primary concept“anthrax”, as well as controlling the number of paired queries.

Referring now to FIG. 20C, according to various aspects of the presentinvention, a technique is provided to generate “chained queries” withparameterized relevance ranking limits, e.g., as may be implemented by aChain Query Generator 299 illustrated in FIG. 18A. A chained queryreplaces primary query terms with alternative terms that have beendetermined to be strongly correlated with all primary terms. The effectis to broaden the range of explored concepts. Chained queries arefurther away from the primary concepts than paired queries. For example,chained queries may be useful for exhaustively mining websites.Moreover, chained queries can be combined with paired queries.

Various aspects of the present invention provide the ability to limitlengths, such as minimum and maximum number of generated query keywords.Additionally, the deep web miner provides the ability to limit textanalytics, such as to English language nouns, verbs, adjectives, andadverbs with recognition of hyphenation, common abbreviations,contractions, ordinals, possessive contractions, etc. The deep web minermay also enable verb stemming, which allows similar verbs to be treatedequally during query generation. Also, the deep web miner may providethe ability for a user to interactively select and prioritize lists ofconcepts used to generate paired and chained queries.

The deep web miner may provide task support, such as for multiple taskswhere each task corresponds to a specific deep-mining goal. In thisregard, each task may be parameterized, executed, stopped, paused,reset, or deleted independently and concurrently. Task parameters mayalso be independently persisted and completed task results may beindependently persisted. Further, tasks may be re-parameterized duringexecution and new parameters are adopted at the earliest possible time.Task re-parameterization may be transactional and multiple parametersmay be set but are applied or rejected together.

According to various aspects of the present invention, the userinterface may provide a single view of all tasks, task execution status,and collected results of selected task. The user interface may alsoprovide a tree view of selected task's results that illustrates queries,each result page, each crawled page, and each deeply-crawled page.Moreover, the user interface may allow the user to set combinations ofdeep-mined results, by URL, including union, intersection, anddifference. Still further, the user interface may allow the user tospecify a unique “current” task. If the task is executing or hascompleted execution, then selecting the task loads the user interfacewith the task's results. The user interface may also display theprogress of each task, including the number of pages and media objectscollected, as well as a dynamically updated meter that represents datacapture bandwidth for the task.

Task termination may be synchronized with deep-mining, crawling, andmedia-capture threads in order to avoid incomplete or broken pages. Forinstance, an HTML page may contain a FRAMESET that refers to multipleFRAMEs, where each FRAME refers to an HTML document, each HTML documentmay refer to multiple media objects and cascading style sheets (CSSs),and each CSS may refer to multiple media objects and/or other CSSs. The“reference tree” for the original FRAMESET document may include dozensor hundreds of URLs. If the original FRAMESET document has beencaptured, and the task is subsequently terminated, either explicitly, orby termination of the entire DWM application, then the DWM will continueto capture referents until the entire reference tree is completed, oruntil a timeout is reached.

According to further aspects of the present invention, task cloning maybe implemented. For example, the deep web miner may be utilized tocreate a parameterized but not-yet-executed copy of a task.

According to further aspects of the present invention, the deep webminer may provide anonymity and/or security. As an example, the deep webminer may implement anonymous deep-mining, crawling and/or DNS using Tor(“The Onion Router”), e.g., as seen by the TOR processor 259 in FIG.18A. Further, the deep web miner may implement user-configurable querysubmission and results URL collection throttling. For example, the deepweb miner may provide the ability to throttle query submission rate inorder to mimic human operation. The deep web miner may also provide theability to throttle results retrieval rate in order to mimic humanoperation. Moreover, the deep web miner may throttle coordination amongmultiple deep web miner instances running on a common LAN. The use ofthrottling may allow the deep web miner to collect information withoutappearing as an automated software agent to the form processing engine186, e.g., by operating at a lower speed to “throttle” theaggressiveness of the search and retrieval to act as if being manuallysteered by an operator. In a related aspect, the use of threads asdescribed more fully herein allow multiple hits to corresponding pagesat the same time. Thus for example, each thread may hit a site only onceevery 30 seconds (or some other defined time interval). However,multiple sites may be visited concurrently when multiple threads areused to deploy crawling efforts.

Still further, the user interface may provide isolated browsing thatuses a proxy server to mimic the HTTP transactions that occurred duringdata collection while constraining browsing to previously-collectedresults related to the currently-selected task. Thus, an isolatedvirtual web space is created. Moreover, such isolated virtual web spacesmay be created for each task. Isolated browsing may also preventuncontrolled scripts and executable objects from executing, e.g., tocontact remote web servers. Also as noted in greater detail herein, thedeep web miner may be capable of isolated cookie spaces. This allows,for example, independent cookie handling policies, such as None, All andFirst-Party.

A form-understanding plug-in may not remain effective in perpetuity,considering that a form processing engine 186 may institute changes thatrender aspects of a form-understanding plug-in obsolete. For example, agiven web site form may change the way that results are displayed, thelogic used to implement search terms may be changed, the form may berelocated or removed, etc. As an illustrative example, a form may changefrom returning results in plain HTML to utilizing a JavaScript-basedapproach. However, according to aspects of the present invention, toolsare provided that allow a user to create new form-understanding plug-insand/or to edit, revise, modify or otherwise adapt the form-understandingplug-in to accommodate certain changes. Moreover, such tools allow auser to adapt the deep web miner 112 to accommodate new and/or changingforms without requiring the user to understand how to write computerprogram code.

As noted in greater detail herein, the deep web miner 112 may leveragean extensible form-understanding plug-in architecture to enhanceautomated processing of on-line forms, e.g., by allowingform-understanding plug-ins 176 to be customizable to accommodatepredetermined and/or arbitrary form characteristics. Moreover, theextensible form-understanding plug-in architecture may provide toolsthat allow users and/or developers to expand or add to the capabilitiesof the form-understanding plug-ins 176, such as by providing thecapability to add new plug-ins, modify existing plug-ins, deleteobsolete plug-ins, etc.

According to aspects of the present invention, a plurality of approachesmay be utilized to create form-understanding plug-ins 176. For example,a user may “teach” a form-understanding plug-in how to interact with aform by demonstrating form interaction, e.g., by pointing to a sitecontaining a form, pointing to anchors and other distinguishing featuresand having the system “learn” patterns necessary to be able to interactwith the form. Still further, an intelligent agent may be able to learnhow to use a form without human intervention, or with minimal humanassistance.

Referring to FIG. 21, a flow diagram 300 illustrates an exemplaryapproach to providing a tool for creating, editing, modifying orotherwise manipulating form-understanding plug-ins 176. In anillustrative implementation, the user interface 114 may include an “AddSite Plug-In” component that provides an interactive dialog andunderlying capabilities to permit users to create form-understandingplug-ins 176. The method may present a user with a wizard-like series ofwindows that the user may interact with for the purpose of “training”the deep web miner 112, thereby providing the information necessary fora form-understanding plug-in to properly engage a specific web site'sform processing engine 186, e.g., to map abstract query terms to thecorrect form inputs, recognize result anchors, and navigate tosubsequent result pages.

The Add Site Plug-In component is schematically divided into steps thatallow user interaction 302 and corresponding system operations 304 toprocess the user interaction 302. The Add Site Plug-In component mayprompt the user to specify a form of interest at 306. In this regard,the user may identify the form by identifying a site URL, a search formwithin that web site, or any other information necessary to identify theform to the system by inputting appropriate information into a dialogbox in a wizard screen. The form is obtained, e.g., retrieved andrendered at 308 and the user may optionally be able to confirm that thecorrect form is retrieved. In this regard, the retrieved form representsa query page for accessing the corresponding site's search engine.

Relevant form input(s) and example search term(s) may then be recognizedor obtained. For example, the user may be prompted to identifycharacteristics of the form to the Add Site Plug-In component. In thisregard, the user may initially identify form inputs at 310. The Add SitePlug-In component then learns the location of the form inputs from useraction, e.g., by requiring the user to point and click on the query termdialog box of the corresponding form. The user is also prompted to enterexample query term(s) at 312. Keeping with the above example of awizard, a dialog box within the wizard may prompt the user to enter asimple query term. Alternatively, the Add Site Plug-In could otherwiseobtain the form inputs and/or exemplary search terms without theassistance of a user, e.g., using a library of recognizers or otherautomated processes.

In response to obtaining this seed information, the Add Site Plug-Incomponent simulates entry of the form to submit a query to the searchengine based on the example query term(s). For example, the Add SitePlug-In component may access the Internet, e.g., using an appropriateHTTP transport 314 (which may be the same as transport 257 describedwith reference to FIG. 18A or a different instance of a transport),navigate to the web site/form of interest and submit the seedinformation obtained from the user based upon the learned location ofthe form inputs at 316.

The Add Site Plug-In component then retrieves and renders the resultspage at 318. For example, the Add Site Plug-In component may receive thequery results returned in response to submitting the query form to thesearch engine. In this regard, the query results may include at leastone page of addresses to locations on the network having contentresponsive to the submitted query.

As an illustrative example, the Add Site Plug-In component may enter auser-provided query term to the form, retrieve one page of searchresults and present the page of search results to the user. In thisregard, the result page may not be “live”. Rather, the wizard may wrapthe result page in its own processing screen to facilitate the learningnecessary to navigate a “live” results page.

The Add Site Plug-In component then recognizes or obtains result anchorsof interest within the query results and derives a pattern thatdistinguishes result anchors from non-result anchors. The Add SitePlug-In component may also recognize or obtain next page anchors ofinterest within the query results from the user and to derive a patternthat distinguishes next page anchors from other anchors. For example,the component may then allow the user to identify relevant result linksat 320. By way of illustration, the user may identify all relevantsearch result anchors present on the returned page of results, such asby clicking on each anchor using a mouse. Because the result page iswrapped, the component can provide feedback to the user to confirm thatthe appropriate information has been identified.

Referring briefly to FIG. 22, a screen shot 350 illustrates an exemplaryimplementation of the “obtain relevant results link” aspect of the AddSite Plug-In component, wherein a user may identify all relevant resultlinks by clicking on each link that corresponds to a valid searchresult. To aid the user in completing the task, such user-identifiedresult links may be visually distinguished 352 from irrelevant links, bycolor, indicia, etc. By way of illustration, and not by way oflimitation, the background of relevant result anchors previouslyidentified by a user may be highlighted in a color such as pink.

Referring back to FIG. 21, given a page of results and a list of theresult anchors of interest, the Add Site Plug-In component learns resultlinks at 322. For example, the Add Site Plug-In component may, accordingto various aspects of the present invention described more fully herein,derive a pattern that the deep web miner 112 can use in futureinteractions to recognize all search results that the search engine 186produces for arbitrary query term(s), such that it can distinguishsearch result anchors contained in the result page from other irrelevantanchors that do not correspond with individual search results, e.g.links corresponding to advertisements, site-specific links, and so on.Additionally, the user may interactively provide an example of how tonavigate to the next page of results at 324 where more than one page ofresults is available given the user provided seed information. The AddSite Plug-In component learns to recognize next page links at 326. Forexample, the Add Site Plug-In component may, according to variousaspects of the present invention, derive a pattern that it can use torecognize anchors used to navigate to subsequent result pages, and todistinguish next page anchors contained in the result page from otherirrelevant anchors that do not permit navigation to the next page ofresults.

The resulting information (web site, form elements, result anchorrecognizer pattern, and next result anchor recognizer pattern) may bereviewed by the user at 328. If the user approves the resultinginformation, the resulting form-understanding plug-in is persisted forsubsequent use by the deep web miner. For example, theform-understanding information may be saved at 330 as aform-understanding plug-in. For example, the Add Site Plug-In componentmay write a file in the local storage 122 that encapsulates a specificform-understanding plug-in implementation 176. Subsequent deep webmining tasks may then utilize the new form-understanding plug-in asdescribed more fully herein.

Not all forms will utilize simple query terms. As such, according tovarious aspects of the present invention, the Add Site Plug-In componentmay use an iterative process to obtain alternate flows from the user.For example, if a form utilizes one or more complex modes, such asphrase, exclusionary terms, etc., the Add Site Plug-In component mayprompt the user to enter each mode so that the appropriate informationcan be learned.

According to aspects of the invention, an Add Site Plug-Inimplementation may utilize a plurality of methods to attempt to “learn”(i.e. derive an effective pattern for) a result link recognizer and anext page link recognizer.

For example, to derive a pattern to distinguish anchors of interest fromothers, the Add Site Plug-In component may recognize or obtain anchorsof interest, e.g., from the user and define a space of web page featuresto explore. The Add Site Plug-In component may further generate a seriesof one or more pattern instances within the web page feature space basedon the anchors of interest and iteratively search through the series ofpattern instances, e.g., from more general patterns to more specificpatterns, to determine if the pattern matches one or more anchorspresent. The Add Site Plug-In component may accept a pattern if itmatches only in the anchors of interest and does not match any otheranchors.

An exemplary implementation may apply a heuristic approach of deriving apattern given examples of valid result anchors. For example, a heuristicapproach may involve searching through a space of HTML features presentwithin and/or nearby the result link anchors that may possiblydistinguish result anchors from non-result anchors. Categories of suchHTML features may be explicitly enumerated in advance within the AddSite Plug-In component, from which specific patterns to test may bederived based on the result anchors present in the example queryresults.

The search through patterns may proceed iteratively, testing moregeneral HTML features first, i.e., those having the broadestapplicability, followed by more specific HTML features, i.e., thoseexpected to be more sensitive to changes a web site may one day make inthe form of its result pages. The search through patterns terminateswhen an effective pattern within the result page HTML is found that cancorrectly distinguish the result anchors from the non-result anchors,unless no such pattern can be found, which may result in a failure toconstruct a form-understanding plug-in.

As yet another illustrative example, an additional method for creatingform-understanding plug-ins 176 may include providing a component toenable a user to create form-understanding plug-ins, such as by writingcustom software or otherwise building the form-understanding plug-insutilizing a library of routines for specifying the information needed tosupport deep web mining operations with a specific site's formprocessing engine 186. In this regard, the library of routines mayenable a user to build a customized form-understanding plug-in byenabling the user to identify a web site, relevant form inputs andsubmission requirement(s) and patterns to distinguish result anchors andnext page anchors from other anchors. For example, the informationspecified may include parameters such as site URLs; relevant form inputsand means of submission; and patterns that may distinguish resultanchors from non-result anchors, and that may distinguish next pageanchors from other anchors.

According to still further aspects of the present invention, some or allof the above-described user interaction in building a form-understandingplug-in may be replaced or otherwise implemented by an automatedprocess. For example, the Add Site Plug-In component may obtain oridentify a web site of interest, recognize or otherwise obtain relevantform input(s), generate or otherwise obtain example search term(s),recognize or otherwise obtain result anchors of interest within thequery results, and/or recognize or otherwise obtain next page anchors ofinterest within the query results, etc., in an automated process.

Still further, the user input may be relegated to an approval mechanism.For example, the Add Site Plug-In component may obtain or identify a website of interest, but prompt the user to confirm the action. Similarly,the Add Site Plug-In component may recognize or otherwise obtainrelevant form input(s), generate or otherwise obtain example searchterm(s), recognize or otherwise obtain result anchors of interest withinthe query results, and/or recognize or otherwise obtain next pageanchors of interest within the query results, etc., in an automatedprocess, then subsequently prompt the user to confirm each action beforesaving the results and/or moving on to the next process.

As an example, rather than absolutely requiring the user to provideinteraction, such as by providing an exemplary search term, the Add SitePlug-In component may use some general term or otherwise selected termthat most search engines would respond to, or iteratively try a somewhatmeaningful set of terms. As yet another example, the Add Site Plug-Incomponent may automatically evaluate a library of effective next-pagerecognizers to find the next page anchors, etc.

Referring to FIG. 23, a block diagram of a data processing system isdepicted in accordance with the present invention. Data processingsystem 400, such as one of the processing devices 102 described withreference to FIG. 1, may comprise one or more processors 402 connectedto system bus 404. Also connected to system bus 404 is memorycontroller/cache 406, which provides an interface to local memory 408.An I/O bus bridge 410 is connected to the system bus 404 and provides aninterface to an I/O bus 412. The I/O bus may be utilized to support oneor more busses and corresponding devices 414, such as bus bridges, inputoutput devices (I/O devices), storage, network adapters, etc. Networkadapters may also be coupled to the system to enable the data processingsystem to become coupled to other data processing systems or remoteprinters or storage devices through intervening private or publicnetworks.

Also connected to the I/O bus may be devices such as a graphics adapter416, storage 418 and a computer usable storage medium 420 havingcomputer usable program code embodied therewith. The computer usableprogram code may execute any aspect of the present invention, forexample, to implement any aspect of any of the methods and/or systemcomponents illustrated in FIGS. 1-22. Moreover, the computer usableprogram code may be utilized to implement any other processes that areused to perform deep web searching, mining, etc., as set out furtherherein.

The various aspects of the present invention may be embodied as systems,computer-implemented methods and computer program products. Also,various aspects of the present invention may take the form of anembodiment combining software and hardware, wherein the embodiment oraspects thereof may be generally referred to as a “component” or“system.” Furthermore, the various aspects of the present invention maytake the form of a computer program product on a computer usable storagemedium having computer-usable program code embodied in the medium or acomputer program product accessible from a computer-usable orcomputer-readable medium providing program code for use by or inconnection with a computer or any instruction execution system.

The software aspects of the present invention may be stored, implementedand/or distributed on any suitable computer usable or computer readablemedium(s). For the purposes of this description, a computer-usable orcomputer readable medium can be any apparatus that can contain, store,communicate, propagate, or transport the program for use by or inconnection with the instruction execution system, apparatus, or device.The computer program product aspects of the present invention may havecomputer usable or computer readable program code portions thereof,which are stored together or distributed, either spatially or temporallyacross one or more devices. The computer-usable or computer-readablemedium may also comprise a computer network itself as the computerprogram product moves from buffer to buffer propagating through thenetwork. As such, any physical memory associated with part of a networkor network component can constitute a computer readable medium.

The program code may execute entirely on a single processing device,partly on one or more different processing devices, as a stand-alonesoftware package or as part of a larger system, partly on a localprocessing device and partly on a remote processing device or entirelyon the remote processing device. In the latter scenario, the remoteprocessing device may be connected to the local processing devicethrough a network such as a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external processingdevice, for example, through the Internet using an Internet ServiceProvider.

The present invention is described with reference to flowchartillustrations and/or block diagrams of methods, apparatus systems andcomputer program products comprising a computer usable medium havingcomputer usable program code embodied therewith, according toembodiments of the invention. It will be understood that each block ofthe flowchart illustrations and/or block diagrams, and combinations ofblocks in the flowchart illustrations and/or block diagrams may beimplemented by system components or computer usable code that definescomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer usable code may also be stored in a computer-readablememory that can direct a computer or other programmable data processingapparatus to function in a particular manner, such that the instructionsstored in the computer usable medium, such as a computer-readablememory, produce an article of manufacture including instruction meanswhich implement the function/act specified in the flowchart and/or blockdiagram block or blocks. The computer program instructions may also beloaded onto a computer or other programmable data processing apparatusto cause a series of operational steps to be performed on the computeror other programmable apparatus to produce a computer implementedprocess such that the instructions which execute on the computer orother programmable apparatus provide steps for implementing thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

Once a computer is programmed to implement the various aspects of thepresent invention, including the methods of use as set out herein, suchcomputer in effect, becomes a special purpose computer particular to themethods and program structures of this invention. The techniquesnecessary for this are well known to those skilled in the art ofcomputer systems.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, one or more blocksin the flowchart or block diagrams may represent a component, segment,or portion of code, which comprises one or more executable instructionsfor implementing the specified logical function(s). In some alternativeimplementations, the functions noted in the blocks may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently or in the reverseorder.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The description of the present invention has been presented for purposesof illustration and description, but is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the invention.

Having thus described the invention of the present application in detailand by reference to embodiments thereof, it will be apparent thatmodifications and variations are possible without departing from thescope of the invention defined in the appended claims.

1. A computer program product to performing deep web mining operationscomprising: a computer usable medium having computer usable program codeembodied therewith, the computer usable program code comprising:computer usable program code configured to define a new taskcorresponding to a concept space associated with a topic of interest toa user; computer usable program code configured to obtain seedinformation with regard to the concept space including identifying atleast one of an on-line form and at least one search term; computerusable program code configured to create at least one deep mining threadassociated with the defined new task, wherein the deep web mining threadperforms a mining process including: computer usable program codeconfigured to define a plurality of content-service threads and crawlerthreads; computer usable program code configured to generate at leastone query derived from keyword information within the corresponding taskand/or terms obtained from analysis of crawled content; computer usableprogram code configured to queue the generated queries; computer usableprogram code configured to declare a specific implementation of anabstract forms-based query service in a corresponding content-servicethread that executes a deep mining process by matching an identifiedon-line form to a corresponding form-understanding plug-in thatunderstands the format of the on-line form, wherein the selectedform-understanding plug-in simulates the submission of a query andidentifies relevant result addresses; computer usable program codeconfigured to queue query result addresses in a crawler queue; computerusable program code configured to asynchronously service each resultaddress by a corresponding crawler thread that obtains content and/ormetadata that is cached in a local storage medium; computer usableprogram code configured to process the content of the returned results;and computer usable program code configured to update a display with alisting of the mined results, wherein the user may browse a localnavigable copy of the crawled results in isolation by selecting anavigable entry of the listing.
 2. The computer program productaccording to claim 1, wherein the computer usable program codeconfigured to process the content of the returned results comprises:computer usable program code configured to utilize a plurality ofprocessors, each processor associated with a different returned filetype.
 3. The computer program product according to claim 1, wherein thecomputer usable program code configured to process the content of thereturned results further comprises: computer usable program codeconfigured to identify the text content of returned results; computerusable program code configured to perform a linguistic organization ofthe identified text; computer usable program code configured to identifynew terms associated with the corresponding concept space; and computerusable program code configured to iteratively repeat the mining processuntil a predetermined stopping event is detected.
 4. The computerprogram product according to claim 1, further comprising: computerusable program code configured to collapse the multiple crawler threadsto a single thread after data retrieval is complete.
 5. The computerprogram product according to claim 1, further comprising: computerusable program code configured to identify keyword generation parametersto control the manner in which query terms are generated as a result ofanalyzing crawled content.
 6. The computer program product according toclaim 1, further comprising: computer usable program code configured toset user parameters regarding cookie privacy policies used when miningcontent associated with the corresponding task.
 7. The computer programproduct according to claim 1, further comprising: computer usableprogram code that allows a user to build a form-understanding plug-inthat is usable by the computer usable program code configured to declarea specific implementation of an abstract forms-based query service in acorresponding content-service thread, comprising: computer usableprogram code configured to obtain a web site of interest; computerusable program code configured to retrieve a query page having a formfor accessing the site's search engine; computer usable program codeconfigured to recognize or obtain relevant form input(s); computerusable program code configured to generate or obtain example searchterm(s); computer usable program code configured to simulate entry ofthe form to submit a query to the search engine based on the examplequery term(s); computer usable program code configured to receive queryresults returned in response to submitting the query form to the searchengine, the query results comprising at least one page of addresses tolocations on the network having content responsive to the submittedquery; computer usable program code configured to recognize or obtainresult anchors of interest within the query results; computer usableprogram code configured to derive a pattern that distinguishes resultanchors from non-result anchors; computer usable program code configuredto recognize or obtain next page anchors of interest within the queryresults; computer usable program code configured to derive a patternthat distinguishes next page anchors from other anchors; and computerusable program code configured for persisting the resultingform-understanding plug-in for subsequent use by the deep web miner. 8.The computer program product according to claim 7, wherein the computerusable program code configured to derive a pattern to distinguishanchors of interest from others comprises: computer usable program codeconfigured to recognize or obtain anchors of interest; computer usableprogram code configured to define a space of web page features toexplore; computer usable program code configured to generate a series ofone or more pattern instances within the web page feature space based onthe anchors of interest; computer usable program code configured toiteratively search through the series of pattern instances to determineif the pattern matches one or more anchors present; and computer usableprogram code configured to accept a pattern if it matches only in theanchors of interest and does not match any other anchors.
 9. Thecomputer program product according to claim 8, wherein the computerusable program code configured to iteratively search through a series ofpattern instances in an web page feature space proceeds from moregeneral patterns to more specific patterns.
 10. The computer programproduct according to claim 1, further comprising: computer usableprogram code configured to enable a user to create deep web miningform-understanding plug-ins comprising: computer usable program codeconfigured to provide a library of routines for specifying theinformation needed to support deep web mining operations with a specificsite's form processing engine, the library of routines enabling a userto build a form-understanding plug-in by identifying: a web site;relevant form inputs and submission requirement; patterns to distinguishresult anchors and next page anchors from other anchors.
 11. A method ofextracting information from a network comprising: executing a userinterface on a computer for obtaining seed information from a user,where the seed information provides sufficient information to define aconcept of interest to the user; identifying a search engine to utilizefor performing deep web mining; mapping the seed information provided bythe user to query terms suitable for use with the identified searchengine; performing an iterative mining process until a stopping event isdetected by: retrieving a query page having a form for accessing thesearch engine; simulating entry of the form to submit a query to thesearch engine based at least in part, upon the derived query terms;receiving query results returned in response to submitting the queryform to the search engine, the query results comprising at least onepage of addresses to locations on the network having content responsiveto the submitted query; identifying addresses of interest from the queryresults for further processing; crawling the network to obtain contentfrom the identified addresses of interest; building a local, navigablecopy of the content obtained from crawling the network in a localstorage device such that links within the content are limited to thelocal copy itself and do not function if the link contents were notcaptured by the corresponding mining process; analyzing the resultingcontent returned from crawling the network generating at least one newcontent based query term based upon analyzing the search results;updating the query terms based upon at least one new content-based queryterm; dynamically conveying the results of processing to the user suchthat the user can interact with a dynamically changing local navigableenvironment while the mining process is iterating; and dynamicallyreconfiguring the iterative mining process based upon user interaction,while the mining process is iterating.
 12. The method of claim 11,wherein obtaining seed information comprises: obtaining seed informationfrom the user that defines at least one of a query term pertaining tothe concept of interest and a name or address of the identified searchengine.
 13. The method of claim 11, further comprising: defining thestopping event as a user imposed link exploration restraint based uponat least one of a total number of links, a link depth or a relevance ofsearch results; and overriding user defined depth constraints if queryconstraints are satisfied.
 14. The method of claim 11, whereinidentifying addresses of interest from the query results for furtherprocessing comprises: distinguishing relevant result addresses fromnon-result addresses present in query result pages; and constraining theaddresses of interest to a super-domain of the search engine.
 15. Themethod according to claim 11, wherein crawling the network to obtaincontent from the identified addresses of interest comprises:constraining crawled pages to at least one of a domain of the searchengine, a super-domain of the search engine, the domain of correspondingresults pages or any number of segments of the domain of thecorresponding results pages; and performing link exploration byidentifying addresses contained in obtained documents including HTML andnon-HTML documents.
 16. The method according to claim 11, furthercomprising: maintaining a plurality of tasks where each task correspondsto a search implemented in response to a user initiated search requestthat can be saved, re-started or re-initialized; and creating aplurality of crawler and content service threads for a correspondingtask, wherein each thread maintains its own cookie space for storingcookies of visited network locations that utilize cookies.
 17. Themethod according to claim 11, wherein generating at least one newcontent based query term based upon analyzing the search resultscomprises at least one of: generating a paired query by narrowing anexisting query with at least one additional conjunctive term that isdetermined to be weakly correlated with exiting primary terms andallowing the user to control the breadth of a mining process bycontrolling how closely concepts in the paired query must relate to acorresponding primary concept; and generating a chained query byreplacing a primary query with alternative terms that have beendetermined to by strongly correlated with all primary terms.
 18. Themethod according to claim 11, wherein simulating entry of the form tosubmit a query to the search engine based at least in part, upon thederived query terms comprises: matching an identified on-line form to acorresponding form-understanding plug-in that understands the format ofthe on-line form, wherein the selected form-understanding plug-insimulates the submission of a query and identifies relevant resultaddresses;
 19. The method according to according to claim 18, furthercomprising enabling a user to build a form-understanding plug-incomprising: obtaining a web site of interest; retrieving a query pagehaving a form for accessing the site's search engine; recognizing orobtaining relevant form input(s); generating or obtaining example searchterm(s); simulating entry of the form to submit a query to the searchengine based on the example query term(s); receiving query resultsreturned in response to submitting the query form to the search engine,the query results comprising at least one page of addresses to locationson the network having content responsive to the submitted query;recognizing or obtaining result anchors of interest within the queryresults; deriving a pattern that distinguishes result anchors fromnon-result anchors; recognizing or obtaining next page anchors ofinterest within the query results; deriving a pattern that distinguishesnext page anchors from other anchors; and persisting the resultingform-understanding plug-in for subsequent use by the deep web miner. 20.The method according to according to claim 18, wherein deriving apattern to distinguish anchors of interest from others comprises:obtaining anchors of interest from the user; defining a space of webpage features to explore; generating a series of one or more patterninstances within the web page feature space based on the anchors ofinterest; iteratively searching through the series of pattern instancesto determine if the pattern matches one or more anchors present byproceeding from more general patterns to more specific patterns; andaccepting a pattern if it matches only in the anchors of interest anddoes not match any other anchors.